Spatial Structure of the Internet Traffic 



Marc Barthelemy 
CEA, Service de Phystque de la Matiere Condensee 
BP12 Bruyeres-Le-Chatel, France 

Bernard Gondran 
Reseau National de Telecommunications pour 
la Technologic, I'Enseignement et la Recherche 
151, Bid de L'Hopital, 75013 Paris, France 

Eric Guichard 

Eqmpe Reseaux, Savoirs & Territoires 
Ecole normale superieure, 75005 Paris, France 
(February 1, 2008) 

The Internet infrastructure is not virtual: its distribution is dictated by social, geographical, 
economical, or political constraints. However, the infrastructure's design does not determine entirely 
the information traffic and different sources of complexity such as the intrinsic heterogeneity of 
the network or human practices have to be taken into account. In order to manage the Internet 
expansion, plan new connections or optimize the existing ones, it is thus critical to understand 
correlations between emergent global statistical patterns of Internet activity and human factors. 
We analyze data from the French national 'Renater' network which has about 2 millions users and 
which consists in about 30 interconnected routers located in different regions of France and we report 
the following results. The Internet flow is strongly localized: most of the traffic takes place on a 
'spanning' network connecting a small number of routers which can be classified either as 'active 
centers' looking for information or 'databases' providing information. We also show that the Internet 
activity of a region increases with the number of published papers by laboratories of that region, 
demonstrating the positive impact of the Web on scientific activity and illustrating quantitatively 
the adage 'the more you read, the more you write'. 

PACS numbers: 02.50 -r, 05.45. Tp, 84.40. Ua, 87.23. Ge 



I. INTRODUCTION 



activity measured by the number of published papers. 



Internet connects different routers and servers using 
different operating systems and transport protocols. This 
intrinsic heterogeneity of the network added to the un- 
predictability of human practices jl] make the Internet 
inherently unrehable and its traffic complex |^-^. 

There has been recently major advances in our un- 
derstanding of the generic aspects of the Internet [7|-[Tc|] 
and web |pT|-p^ structure and development. Concerning 
data transport, most of the studies focus on properties 
at short time scales or at the level of individual connec- 
tions P jr^ , p^ , while studies on statistical flow properties 
at a large scale |^,^,^^ concentrate essentially on the 
phase transition to a congested regime. Despite of these 
results, large scale studies of traffic variations in time 
and space are still needed before understanding the new 
social practices of Internet users. 

In this paper, we study the spatial structure of the 
large scale flow. We present in part II the data studied 
and in parts III and IV the results of our analysis, show- 
ing the existence of a spanning network concentrating the 
major part of the traffic. Finally, in part V we relate the 
flow properties and its spatial distribution to scientific 



II. DATA STUDIED 

An important difficulty is to obtain real data measure- 
ments of the Internet traffic on a global scale. The avail- 
ability of data of the French network 'Renater' allows us 
to consider the cartography of Internet's traffic and its 
relation with regional socio-economical factors. 

The French network 'Renater' has about 2 million users 
and is constituted of a nation-wide infrastructure and of 
international links Most of the research, technologi- 
cal, educational or cultural institutions are connected to 
Renater (Fig. This network enables them to commu- 
nicate between each other, to get access to public or pri- 
vate world-wide research institutes and to be connected 
to the global Internet. 

We first restrict our analysis to the national traffic and 
exclude the information exchange with external hosts and 
routers such as US and Europe Internet or peering with 
other ISPs. This restriction to a small part of the Re- 
nater traffic (~ 5% of roughly 2000 Gigabytes a day) has 
two methodological advantages: First, it ensures that 
the traffic studied is strictly professional (mail to non- 
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academics, like family, friends, consultation of newspa- 
pers, e-commerce, etc. goes through outer ISP and is 
not taken into account); Second, it helps to understand 
the regional traffic structure and its relation with local 
economical factors. We believe that the global patterns 
emerging for the Renater network will be relevant for 
larger structures such as the global Internet. 

The data consist of the real exchange flow (sum of Ftp, 
Telnet, Mail, Web browsing, etc.) between all routers, 
even if there is not a direct (physical) link between all 
of them. For a connection (i, j) between routers i and j 
{i ^ j), Fij{t) (in bytes per 5 minutes) is the effective 
information flow at time t going out from i to j. For 
technical reasons, data for a few routers were not reliable 
and we analyzed data for 26 routers which amounts in 
26 X 25 matrices Fij{t) given for every At = 5 minutes 
for a two weeks period (the quantities Fa are excluded 
from the present study). 

As an example of the measured time-series, we show 
(Fig. H) the information flow versus time between two 
routers located in Grenoble and Marseille for a nine days 
period. One can see the different days and within days, 
bursts of intense activity. In this study, we focus on the 
flow and not on the growth rate (and its correlations) 
used in a previous study [ pT| . 

III. DATABASES VERSUS ACTIVE CENTERS 

We now present our empirical results. The time aver- 
aged incoming flow 

F,n{^)^Y.PT^ (1) 
J 

at a given router j is a measure of the Internet activity of 
the corresponding region (the over-bar denotes the aver- 
age over time). On the other hand, the average outgoing 
flow 

Fout{i) = Y.'^^■J (2) 

3 

can be interpreted as the total request emanating from 
other routers. It is thus a measure of the degree of inter- 
est produced by this region. 

We plotted both quantities Fin , Fout versus their rank 
(Fig. 3a, b). In contrast with many cases observed since 
the work of Zipf j2^, the observed distributions are not 
power laws but exponentials. This might be the signature 
of a transient regime and would mean that the Internet 
didn't reach his stationary state, but it is more proba- 
bly the sign that the Internet traffic has a unique, non 
hierarchical-type structure psf . This exponential behav- 
ior also means that-at least in the Renater network-there 
are essentially two categories of regions. Considering In- 
ternet activity (Fig. 3a), one can distinguish active from 



(almost) inactive regions. Roughly, there are about eight 
cities which receives 80% of the total traffic, the rest be- 
ing (exponentially) negligible. Concerning the outgoing 
flow (Fig. 3b), there are about five most visited regions, 
the rest being comparatively 'unattractive'. We checked 
that for different time windows the order of these cities 
can slightly change, but the exponential behavior is in- 
dependent of these seasonal effects. 

It is interesting to note that the most active and visited 
regions are not the same showing that each region has its 
specific activity. Regions with a large incoming flow can 
be classified as active research centers with a great need 
of information, and regions with a large outgoing flow 
correspond to important information resources such as 
e.g., databases or libraries. 

At this stage, we have shown that in the Renater traf- 
fic there is a small number of receivers (located in active 
regions) and emitters (visited databases). However, a 
further question concerns the secondary routers and the 
fine structure of flows. Indeed, Fin (and similarly for 
Fout) could be a sum of many small contributions com- 
ing from many regions, or in contrast there could be only 
few regions which exchange a significant flow. Simple 
quantities which can characterize the flne structure of 
the incoming flow at router i are the Y^'s introduced in 
another context |Q 

N 

y,(z)=^(M^^,)'= (3) 

where Wji = Fji/Fin is the weight associated with the 
incoming flow Fji (and similar expressions for the struc- 
ture of outgoing flow). It is easy to see that Yq = N, 
Yi — 1 and the first non trivial quantity is Y2. We can 
illustrate the physical meaning of I2 with simple exam- 
ples. If all weights are of the same order Wji ~ for 
all i,j then Y2 ^ 1/N is very small. In contrast, if one 
weight is important for example of the order ~ 1/2 and 
the others negligible ~ l/2(iV - 1) then 1^2 ~ 1/4 is of 
order unity. Thus 1/ ^2 is a measure of the number of im- 
portant weights. We plot Y2 for both the incoming and 
outgoing flows (the statistics is over two weeks) . The re- 
sult (Fig. ^) shows clearly that the most probable value is 
1/4 and that Y2 is larger than 1/7V ~ 1/30 ~ 0.03 (except 
for few cases which appear in the histogram). This con- 
firms the fact that a few routers are exchanging most of 
the information, the rest of the network being negligible. 

IV. SPANNING NETWORK 

In order to illustrate the above results, we construct 
the network Sk connecting a number k of routers and 
carrying the maximal flow denoted by F{Sk)- We in- 
crease k from 2 to 26 and we obtain the result plotted in 
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Fig. (||a) . It appears clearly that a small fraction of links 
carry most of the flow. This behavior is encoded in the 
fact that Ftot is a power law for p — > 

Ftot{Sk)-p' (4) 

with an exponent smaller than one {6 ~ 0.6), so that a 
small variation of the number of connections leads to a 
large variation of the transported flow. 

This analysis completes the Renater traffic map: there 
is a small number of receivers and emitters exchanging 
significant information between them, the rest of the net- 
work being exponentially negligible. This demonstrates 
the existence of a 'spanning network' carrying most of 
the traffic and connecting the main emitters to the main 
receivers. 

In order to visualize this spanning network on the 
French map, we filter the flow with the following pro- 
cedure. We first select flows above a certain threshold 
fc, and then we select a connection (ij) only if the cor- 
responding flow Fij represents a large percentage of (i) 
the outgoing flow from i and (ii) the incoming flow in 
j. The result is shown on Fig. gb. We checked that the 
instantaneous (or averaged over a different time window) 
spanning network is eventually different, but always with 
the same characteristics (small number of interconnected 
emitters and receivers). The procedure described above 
could thus be used as a simple filter in order to visualize 
in real-time a complex flow matrices. 

V. DIGRESSION TO SCIENTOMETRICS: 
INTERNET TRAFFIC VERSUS SCIENTIFIC 
ACTIVITY 

So far, we have studied statistical properties of the 
trafflc, but an important point is to relate them to eco- 
nomical or social factors. The Internet activity should in 
principle be related to social pointers such as the num- 
ber of inhabitants, the number of students, and so on. 
From our data, the indicator which shows the best cor- 
relation with Internet activity is scientific activity mea- 
sured by the number of published papers [p5| . One can 
expect that the more a scientist consults books or data, 
the more he/she will publish. This principle, 'the more 
you read, the more you write', although commonly ac- 
cepted in a number of historical cases is difflcult 
to evaluate quantitatively. The main difficulty being the 
measure of the amount of information gathered by sci- 
entists in libraries. In the case of Internet, the informa- 
tion gathered by scientists working in a given region can 
be estimated by the average incoming flow in the cor- 
responding router. Since the information needed for a 
scientist is usually scattered world-wide, it is important 
here to take into account the total incoming flow, includ- 
ing exchanges with international hosts. We thus compare 



the total average incoming flow (per scientist) with the 
average number of papers published (per scientist) per 
year by the region's universities (obtained from the SCI 
database). As a representative panel, we choose to use 
data about papers published only by scientists in the na- 
tional research institution (CNRS |^). We represent 
these data on Fig. ^. This plot shows that the average 
incoming flow per scientist / in a region is increasing with 
the number of published scientific papers per scientist p 
by this region's laboratories roughly as a power law 

7^/ (5) 

with exponent /3 ~ 1.1 ± 0.1. This result confirms quan- 
titatively the intuitive principle stated above and is par- 
ticularly interesting from the point of view of the Web's 
social impact. Indeed, it implies that the number of pub- 
lications is growing with the incoming flow as a power law 
with exponent 1/(3 ~ 1: the more one uses Internet, the 
more one publishes! This result indicates that on aver- 
age the use of Internet has a positive impact on research 
productivity. 

VI. CONCLUSION 

In summary, we have shown that the major part of 
the traffic takes place only between a few routers while 
the rest of the network is almost negligible. We have 
proposed a simple procedure to extract this (bipartite) 
spanning network, which could have some implications in 
visualization and monitoring of real-time traffic. In ad- 
dition, resources allocation and capacity planning tasks 
could benefit from the knowledge of such a spanning net- 
work. These results point towards new ways of under- 
standing and describing real-world trafflc. In particular, 
any microscopic model should recover these statistical 
properties and our results provide a quantitative basis 
for modeling the dynamics of information flow. 

We also have shown that the scientiflc activity of a 
region is increasing with its Internet activity. This indi- 
cates that it is difficult for a scientist to avoid the use of 
Internet without affecting his/her productivity measured 
in terms of publications. This result also demonstrates 
that in addition to increase people's social capital ]2^ ] 
the Internet has a measurable positive impact on research 
production. More generally, it underlines the importance 
of Internet as knowledge sharing vector. This study also 
suggests that the Internet activity could be used as an 
interesting new socio-economical pointer well adapted to 
the information society. 

Finally, these results exhibit some global statistical 
patterns shedding light on the relations between the In- 
ternet and economical factors. It shows that in addition 
to the structural complexity of the web and the Internet, 
the traffic has its own complexity with its own cartogra- 
phy 
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FIG. 1. The Renater Network. There is a total of 
about 30 interconnected main routers. To each router cor- 
responds a region comprising a main city. The data con- 
sist in a flow matrix Fij{t) which gives the effective flows 
(virtual or physical) on the connection between routers i 
and j. For more det ails on this network, see the web page 



http://w ww.renater.fr and for an animated v ersion of flows, 
sec |http:77barthes. ens.fr/metrologie/Renater01 . 




FIG. 2. Lin-Log plot of the information flow versus time 
from Grenoble to Marseille over a 9 days period. We represent 
the raw data: the number of bytes (per 5 minutes) exchanged 
between these two cities. One can see the different days and 
within days, bursts of intense activity. 
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FIG. 3. Exponential disparities between regions. Lin-Log 
Zipf (rank) plot of the Internet activity (in bytes/5minutes) 
a Incoming flow Fin and b outgoing flow, averaged over two 
weeks. The relation is an exponential (indicated by solid lines) 
Fi„(out) = exp{~r/rir,{out)) with ri„ ~ 8 in the case (a) and 
rout — 5 for case (b). This result allows one to separate 
regions in two distinct groups: 'active' (r < ri„(^aut)) and 
'inactive' ones (r > ri„{o„t))- We indicate in each case, the 
four first cities. This order can change according to different 
period, but the exponential behavior will still holds. 
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FIG. 4. Fine structure of flows. We represent the proba- 
bility distribution of Y2 for both incoming and outgoing flows. 
A small value of I2 corresponds to a very 'fragmented' flow, 
while a large value means that there are only one or two im- 
portant contributions, the rest being negligible. The arrow 
indicates the value l/N ~ 0.03 which corresponds to flows for 
which each router contributes. The distributions are peaked 
for Y2 — 0.2 and are concentrated in the range 0.2 — 1. This 
indicates that essentially a few routers contribute to each flow. 
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FIG. 5. The spanning network, a Ftot{Sk) is the total 
flow carried over the network Sk which (i) connects a number 
k of routers and (ii) carries the maximal flow. The quantity p 
is the corresponding number of connections between k routers 
over the total number of possible connection in the whole net- 
work. In insert, we show the same plot in Log-Log showing 
that for p ~ 0, the flow is growing as a power law with 
~ 0.6. b We apply the flltering procedure explained in the 
text and we obtain the spanning network. In this example, it 
is constituted by f4 connections between If routers (which is 
2% of the total number possible connections) and carries 30% 
of the total flow. On this map, the width of the connection is 
a slowly increasing function of the volume flow passing in it. 
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FIG. 6. Internet activity versus scientific research. In- 
ternet activity versus scientific research. Log-Log plot of the 
incoming flow per scientist in a region versus the average ratio 
of scientific papers per scientist published (per year) by sci- 
entists working in that region. The solid line is a least square 
estimate with a slope of order LI ± 0.1. The correlation co- 
efficient C = 0.80 while F — 41 for one degree of freedom. 
This plot shows that scientific productivity is increasing with 
Internet activity. 
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